-
Notifications
You must be signed in to change notification settings - Fork 90
feat: Add a parquet uuid calculation #3440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files
🚀 New features to boost your workflow:
|
@NJManganelli - what is the status of this PR? Are you still working on it? Thanks! |
Hi @ianna I'll add a test, then I think it'll be ready from my side |
Not without a performance penalty, but if it needs to be optimized, we could figure out a smarter but still sufficient calculation (I'd like to ensure that any changes in compression, columns, rows is captured). It's also possible that I didn't explore enough some checksum information that's supposed to be available (but these I think were at the page level or something, and just the loop over all those seems like it would be much worse than this) Without uuid: python -m timeit -n 1000 "import test_3440_calculate_parquet_uuid; test_3440_calculate_parquet_uuid.test_parquet_uuid()"
1000 loops, best of 5: 83 usec per loop With uuid:
|
…e first and last row_groups plus the col_counts of all row_groups of the file or dataset
5000ce8
to
eab037d
Compare
Rebased for my own sanity, and marked ready (presuming all the tests are going to pass, will fix if otherwise) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NJManganelli - it looks like uuids do not match:
______________________________ test_parquet_uuid _______________________________
def test_parquet_uuid():
meta = metadata_from_parquet(input)
> assert (
meta["uuid"]
== "93fd596534251b1ac750f359d9d55418b02bcddcc79cd5ab1e3d9735941fbcae"
)
E AssertionError: assert 'adc236484e10384ad680a1ae2ce1fc675f6f9f0a84a02c651ed91ad97fba41c0' == '93fd596534251b1ac750f359d9d55418b02bcddcc79cd5ab1e3d9735941fbcae'
E
E - 93fd596534251b1ac750f359d9d55418b02bcddcc79cd5ab1e3d9735941fbcae
E + adc236484e10384ad680a1ae2ce1fc675f6f9f0a84a02c651ed91ad97fba41c0
meta = {'col_counts': [5],
'columns': ['u1', 'u4', 'u8', 'f4', 'f8', 'raw', 'utf8'],
'form': RecordForm([BitMaskedForm('u8', NumpyForm('bool'), True, True), BitMaskedForm('u8', NumpyForm('int32'), True, True), BitMaskedForm('u8', NumpyForm('int64'), True, True), BitMaskedForm('u8', NumpyForm('float32'), True, True), BitMaskedForm('u8', NumpyForm('float64'), True, True), BitMaskedForm('u8', ListOffsetForm('i32', NumpyForm('uint8', parameters={'__array__': 'byte'}), parameters={'__array__': 'bytestring'}), True, True), BitMaskedForm('u8', ListOffsetForm('i32', NumpyForm('uint8', parameters={'__array__': 'char'}), parameters={'__array__': 'string'}), True, True)], ['u1', 'u4', 'u8', 'f4', 'f8', 'raw', 'utf8']),
'fs': <fsspec.implementations.local.LocalFileSystem object at 0x7ff63efc9340>,
'num_row_groups': 1,
'num_rows': 5,
'paths': ['/home/runner/work/awkward/awkward/tests/samples/nullable-record-primitives.parquet'],
'uuid': 'adc236484e10384ad680a1ae2ce1fc675f6f9f0a84a02c651ed91ad97fba41c0'}
tests/test_3440_calculate_parquet_uuid.py:22: AssertionError
Aye, looks like this will need to be more selective of what goes into the hash, from that printout. I’ll have a look when I am back from holidays |
…ulnerable to OS-specific effects
… for OS-agnostic uuid
@ianna I'm trying a more selective set of key-value pairs, hoping it'll be more stable, but "it works on my machine" the same as the previous one, so need to see what the tests say I think |
Calculate a uuid from parquet metadata, utilizing detailed info of the first and last row_groups plus the col_counts of all row_groups of the file or dataset. At the column-page level, parquet should have a checksum AFAIK, but an approximate calculation that would catch differences in numbers of rows, row groups, columns, compression, etc. that deterministically uses two row groups should be sufficient for the equivalent of what coffea does with root files (which is flag them for changes to recalculate the form, steps, etc.).
https://github.com/scikit-hep/coffea/blob/master/src/coffea/dataset_tools/preprocess.py#L46-L48
Also, the ParquetMetadata namedtuple doesn't appear to be used, at least in this file that's touched. Given there's an extra line to handle not changing the length of returned tuple to try and avoid breaking compatibility with outside users, maybe this should be deprecated and the namedtuple should be used instead?